RFR: 8335480: Only deoptimize threads if needed when closing shared arena [v3]
Jorn Vernee
jvernee at openjdk.org
Mon Jul 15 11:49:53 UTC 2024
On Mon, 15 Jul 2024 11:33:30 GMT, Jorn Vernee <jvernee at openjdk.org> wrote:
>> This PR limits the number of cases in which we deoptimize frames when closing a shared Arena. The initial intent of this was to improve the performance of shared arena closure in cases where a lot of threads are accessing and closing shared arenas at the same time (see attached benchmark), but unfortunately even disabling deoptimization altogether does not have a great effect on that benchmark.
>>
>> Nevertheless, I think the extra logging/testing/benchmark code, and comments I've written, together with reducing the number of cases where we deoptimize (which makes it clearer exactly why we need to deoptimize in the first place), will be useful going forward. So, I've a create this PR out of them.
>>
>> In this PR:
>> - Deoptimizing is now only done in cases where it's needed, instead of always. Which is in cases where we are not inside an `@Scoped` method, but are inside a compiled frame that has a scoped access somewhere inside of it.
>> - I've separated the stack walking code (`for_scope_method`) from the code that checks for a reference to the arena being closed (`is_accessing_session`), and added logging code to the former. That also required changing vframe code to accept an `ouputStream*` rather than always printing to `tty`.
>> - Added a new test (`TestConcurrentClose`), that tries to close many shared arenas at the same time, in order to stress that use case.
>> - Added a new benchmark (`ConcurrentClose`), that stresses the cases where many threads are accessing and closing shared arenas.
>>
>> I've done several benchmark runs with different amounts of threads. The confined case stays much more consistent, while the shared cases balloons up in time spent quickly when there are more than 4 threads:
>>
>>
>> Benchmark Threads Mode Cnt Score Error Units
>> ConcurrentClose.sharedAccess 32 avgt 10 9017.397 ± 202.870 us/op
>> ConcurrentClose.sharedAccess 24 avgt 10 5178.214 ± 164.922 us/op
>> ConcurrentClose.sharedAccess 16 avgt 10 2224.420 ± 165.754 us/op
>> ConcurrentClose.sharedAccess 8 avgt 10 593.828 ± 8.321 us/op
>> ConcurrentClose.sharedAccess 7 avgt 10 470.700 ± 22.511 us/op
>> ConcurrentClose.sharedAccess 6 avgt 10 386.697 ± 59.170 us/op
>> ConcurrentClose.sharedAccess 5 avgt 10 291.157 ± 7.023 us/op
>> ConcurrentClose.sharedAccess 4 avgt 10 209.178 ± 5.802 us/op
>> ConcurrentClose.sharedAccess 1 avgt 10 ...
>
> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:
>
> improve benchmark
I've update the benchmark to run with 3 separate threads: 1 thread that is just creating and closing shared arenas in a loop, 1 that is accessing memory using the FFM API, and 1 that is accessing a `byte[]`.
Current:
Benchmark Mode Cnt Score Error Units
ConcurrentClose.sharedClose avgt 10 50.093 ± 6.200 us/op
ConcurrentClose.sharedClose:closing avgt 10 46.269 ± 0.786 us/op
ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 98.072 ± 19.061 us/op
ConcurrentClose.sharedClose:otherAccess avgt 10 5.938 ± 0.058 us/op
I do see a pretty big difference on the memory segment accessing thread when I remove deoptimization altogether:
Benchmark Mode Cnt Score Error Units
ConcurrentClose.sharedClose avgt 10 22.664 ± 0.409 us/op
ConcurrentClose.sharedClose:closing avgt 10 45.351 ± 1.554 us/op
ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 16.671 ± 0.251 us/op
ConcurrentClose.sharedClose:otherAccess avgt 10 5.969 ± 0.089 us/op
When I remove the `has_scoped_access()` check before the deopt, I expect the `otherAccess` thread to be affected, but the effect isn't nearly as big as with the FFM thread. I think this is likely due to the `otherAccess` benchmark being less sensitive to optimization (i.e. it already runs fairly fast in the interpreter). I also tried using `MethodHandles::arrayElementGetter` for the access, but the numbers I got were pretty much the same:
Benchmark Mode Cnt Score Error Units
ConcurrentClose.sharedClose avgt 10 52.745 ± 1.071 us/op
ConcurrentClose.sharedClose:closing avgt 10 46.670 ± 0.453 us/op
ConcurrentClose.sharedClose:memorySegmentAccess avgt 10 102.663 ± 3.430 us/op
ConcurrentClose.sharedClose:otherAccess avgt 10 8.901 ± 0.109 us/op
I think, to really test the effect of the `has_scoped_access` check, we need to look at a more realistic scenario.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20158#issuecomment-2228311368
More information about the core-libs-dev
mailing list