RFR: 8267532: Try/catch block not optimized as expected
Jorn Vernee
jvernee at openjdk.org
Wed Nov 1 13:48:27 UTC 2023
The issue is essentially that for the Java try-with-resource construct, javac generates multiple calls to `close(`) the resource. One of those calls is inside the hidden exception handler of the try block. The issue for us is that typically the exception handler is never entered (since no exception is thrown), however we don't profile exception handlers at the moment, so the block is not pruned. C2 doesn't inline the `close()` call in the handler due to low call site frequency. As a result, the receiver of that call escapes and can not be scalar replaced, which then leads to a loss in performance.
There has been some discussion on the JBS issue that this could be fixed by profiling catch blocks. And another suggestion that partial escape analysis could help here to prevent the object from escaping. But, I think there are other benefits to being able to prune dead catch blocks, such as general reduction in code size, and other optimizations being possible by dead code being eliminated. So, I've implemented catch block profiling + pruning in this patch.
The implementation is essentially very straightforward: we allocate an extra bit of profiling data for each
exception handler of a method in the `MethodData` for that method (which holds all the profiling
data). Then when looking up the exception handler after an exception is thrown, we mark the
exception handler as entered. When C2 parses the exception handler block, and it sees that it has
never been entered, we emit an uncommon trap instead.
I've also cleaned up the handling of profiling data sections a bit. After adding the extra section of data to MethodData, I was seeing several crashes when ciMethodData was used. The underlying issue seemed to be that the offset of the parameter data was computed based on the total data size - parameter data size (which doesn't work if we add an additional section for exception handler data). I've re-written the code around this a bit to try and prevent issues in the future. Both MethodData and ciMethodData now track offsets of parameter data and exception handler data, and the size of the each data section is derived from the offsets.
Finally, there was an assert firing in `freeze_internal` in `continuationFreezeThaw.cpp`:
assert(monitors_on_stack(current) == ((current->held_monitor_count() - current->jni_monitor_count()) > 0),
"Held monitor count and locks on stack invariant: " INT64_FORMAT " JNI: " INT64_FORMAT, (int64_t)current->held_monitor_count(), (int64_t)current->jni_monitor_count());
This assert relies on `has_monitors` being set for a method, which in itself relies on `monitorenter` and `monitorexit` being parsed. However, if we prune untaken exception handlers, we might not see any `monitorexit`, which is a problem for OSR compilations since then we might also not see any `monitorenter` in that case. After some investigation, it turns out that `ciMethod` already tracks whether monitor bytecodes are being used, so we can just piggyback on that instead of relying on `monitorenter` or `monitorexit` being parsed. We can follow the existing pattern for how `has_reserved_stack_access` is being tracked (which I've done). See https://github.com/openjdk/jdk/pull/16416/commits/a48420681549ac9343f625e1a3a26a737fc4921e https://github.com/openjdk/jdk/pull/16416/commits/a33a905689a056ac053ac34df64541b652747076 and https://github.com/openjdk/jdk/pull/16416/commits/d727df704ea092eedb20517bdd696d82d75b00b2
Benchmark with `-XX:-PruneDeadExceptionHandlers`:
Benchmark Mode Cnt Score Error Units
ResourceScopeCloseMin.confined_close avgt 30 10.458 ± 0.070 ns/op
ResourceScopeCloseMin.confined_close:gc.alloc.rate avgt 30 9480.988 ± 63.335 MB/sec
ResourceScopeCloseMin.confined_close:gc.alloc.rate.norm avgt 30 104.000 ± 0.001 B/op
ResourceScopeCloseMin.confined_close:gc.count avgt 30 119.000 counts
ResourceScopeCloseMin.confined_close:gc.time avgt 30 94.000 ms
ResourceScopeCloseMin.confined_close_notry avgt 30 4.691 ± 0.063 ns/op
ResourceScopeCloseMin.confined_close_notry:gc.alloc.rate avgt 30 11383.693 ± 151.145 MB/sec
ResourceScopeCloseMin.confined_close_notry:gc.alloc.rate.norm avgt 30 56.000 ± 0.001 B/op
ResourceScopeCloseMin.confined_close_notry:gc.count avgt 30 120.000 counts
ResourceScopeCloseMin.confined_close_notry:gc.time avgt 30 104.000 ms
with `-XX:+PruneDeadExceptionHandlers`:
Benchmark Mode Cnt Score Error Units
ResourceScopeCloseMin.confined_close avgt 30 4.563 ± 0.043 ns/op
ResourceScopeCloseMin.confined_close:gc.alloc.rate avgt 30 11702.868 ± 108.816 MB/sec
ResourceScopeCloseMin.confined_close:gc.alloc.rate.norm avgt 30 56.000 ± 0.001 B/op
ResourceScopeCloseMin.confined_close:gc.count avgt 30 121.000 counts
ResourceScopeCloseMin.confined_close:gc.time avgt 30 93.000 ms
ResourceScopeCloseMin.confined_close_notry avgt 30 4.601 ± 0.054 ns/op
ResourceScopeCloseMin.confined_close_notry:gc.alloc.rate avgt 30 11605.391 ± 134.000 MB/sec
ResourceScopeCloseMin.confined_close_notry:gc.alloc.rate.norm avgt 30 56.000 ± 0.001 B/op
ResourceScopeCloseMin.confined_close_notry:gc.count avgt 30 121.000 counts
ResourceScopeCloseMin.confined_close_notry:gc.time avgt 30 101.000 ms
Note that with the optimization turned on, timing and `gc.alloc.rate.norm` is ~equal.
I also noticed through other experiments that C2's ability to inline improves, due to `inline_instructions_size` being reduced for methods with untaken exception handlers, which might bring the size under `InlineSmallCode`, and allow the method to be inlined again.
Testing : tier 1-6 (ongoing). Local run of `hotspot_compiler` suite with `-XX:+DeoptimizeALot` and with `-XX:+StressPrunedExceptionHandlers`.
-------------
Commit messages:
- use TWR in foreign benchmarks
- reduce duplication
- beef up test a bit
- improve StressPrunedExceptionHandlers comment
- Revert "Revert "add exception handler pruning stress flags""
- also set has_monitors when callee is synchronized
- Revert "add exception handler pruning stress flags"
- use has_monitor_bytecodes() in c1 as well
- use ciMethod::has_monitor_bytecodes
- add exception handler pruning stress flags
- ... and 18 more: https://git.openjdk.org/jdk/compare/6f352740...059e306c
Changes: https://git.openjdk.org/jdk/pull/16416/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16416&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8267532
Stats: 598 lines in 23 files changed: 498 ins; 34 del; 66 mod
Patch: https://git.openjdk.org/jdk/pull/16416.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/16416/head:pull/16416
PR: https://git.openjdk.org/jdk/pull/16416
More information about the hotspot-dev
mailing list