RFR: 8330969: scalability issue with loaded JVMTI agent
Serguei Spitsyn
sspitsyn at openjdk.org
Wed Apr 24 16:11:50 UTC 2024
This is a fix of the following JVMTI scalability issue. A closed benchmark with millions of virtual threads shows 4X overhead when a JVMTI agent has been loaded. For instance, this is observable when an app is executed under control of the Oracle Studio `collect` utility.
The benchmark takes a little bit more than 3 sec without any JVMTI agent:
Total: in 3045 ms
The benchmark takes more than ~3.2X of the above when executed with the `collect` utility:
Creating experiment database test.1.er (Process ID: 25262) ...
Picked up JAVA_TOOL_OPTIONS: -agentlib:collector
Total: in 9864 ms
With the fix in place the overhead of a JVMTI agent is around 1.2X:
Creating experiment database test.1.er (Process ID: 26442) ...
Picked up JAVA_TOOL_OPTIONS: -agentlib:collector
Total: in 3765 ms
The most of the overhead is taken by two functions:
- `JvmtiVTMSTransitionDisabler::start_VTMS_transition()`
- `JvmtiVTMSTransitionDisabler::finish_VTMS_transition()`
Oracle Studio Performance Analyzer `err_print utility shows the following performance data for these functions:
```
% er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::start_VTMS_transition test.1.er
Attr. Total Name
CPU
sec. %
=============== Callers
42.930 50.06 SharedRuntime::notify_jvmti_vthread_mount(oopDesc*, unsigned char, JavaThread*)
21.505 25.08 JvmtiVTMSTransitionDisabler::VTMS_vthread_end(_jobject*)
21.315 24.86 JvmtiVTMSTransitionDisabler::VTMS_vthread_unmount(_jobject*, bool)
=============== Stack Fragment
81.407 94.94 JvmtiVTMSTransitionDisabler::start_VTMS_transition(_jobject*, bool)
=============== Callees
4.083 4.76 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool)
0.140 0.16 __tls_get_addr
0.120 0.14 JNIHandles::resolve_external_guard(_jobject*)
% er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::finish_VTMS_transition test.1.er
Attr. Total Name
CPU
sec. %
=============== Callers
47.363 52.59 SharedRuntime::notify_jvmti_vthread_unmount(oopDesc*, unsigned char, JavaThread*)
21.355 23.71 JvmtiVTMSTransitionDisabler::VTMS_vthread_mount(_jobject*, bool)
21.345 23.70 JvmtiVTMSTransitionDisabler::VTMS_vthread_start(_jobject*)
=============== Stack Fragment
64.145 71.22 JvmtiVTMSTransitionDisabler::finish_VTMS_transition(_jobject*, bool)
=============== Callees
25.288 28.08 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool)
0.240 0.27 __tls_get_addr
0.200 0.22 JavaThread::set_is_in_VTMS_transition(bool)
0.190 0.21 JNIHandles::resolve_external_guard(_jobject*)
The main source of this overhead (~90% of overhead) is atomic increment and decrement of the global counter `VTMS_transition_count`:
- `Atomic::inc(&_VTMS_transition_count)`;
- `Atomic::dec(&_VTMS_transition_count)`;
The fix is to replace this global counter with mark bits `_VTMS_transition_mark` distributed over all `JavaThread`'s.
If these lines are commented out or replaced with the distributed thread-local marks the main performance overhead is gone:
% er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::start_VTMS_transition test.2.er
Attr. Total Name
CPU
sec. %
============== Callers
1.801 64.29 SharedRuntime::notify_jvmti_vthread_mount(oopDesc*, unsigned char, JavaThread*)
0.580 20.71 JvmtiVTMSTransitionDisabler::VTMS_vthread_unmount(_jobject*, bool)
0.420 15.00 JvmtiVTMSTransitionDisabler::VTMS_vthread_end(_jobject*)
============== Stack Fragment
0.630 22.50 JvmtiVTMSTransitionDisabler::start_VTMS_transition(_jobject*, bool)
============== Callees
1.931 68.93 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool)
0.220 7.86 __tls_get_addr
0.020 0.71 JNIHandles::resolve_external_guard(_jobject*)
% er_print -viewmode expert -metrics ie.%totalcpu -csingle JvmtiVTMSTransitionDisabler::finish_VTMS_transition test.2.er
Attr. Total Name
CPU
sec. %
============== Callers
1.661 39.15 JvmtiVTMSTransitionDisabler::VTMS_vthread_mount(_jobject*, bool)
1.351 31.84 JvmtiVTMSTransitionDisabler::VTMS_vthread_start(_jobject*)
1.231 29.01 SharedRuntime::notify_jvmti_vthread_unmount(oopDesc*, unsigned char, JavaThread*)
============== Stack Fragment
0.500 11.79 JvmtiVTMSTransitionDisabler::finish_VTMS_transition(_jobject*, bool)
============== Callees
2.972 70.05 java_lang_Thread::set_is_in_VTMS_transition(oopDesc*, bool)
0.350 8.25 JavaThread::set_is_in_VTMS_transition(bool)
0.340 8.02 __tls_get_addr
0.080 1.89 JNIHandles::resolve_external_guard(_jobject*)
The rest of the overhead (~10% of total overhead) is taken by calls to the function `java_lang_Thread::set_is_in_VTMS_transition()`. The plan is to address this in a separate fix. But it is expected to be a little bit more tricky.
Testing:
- Tested with mach5 tiers 1-6
-------------
Commit messages:
- 8330969: scalability issue with loaded JVMTI agent
Changes: https://git.openjdk.org/jdk/pull/18937/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18937&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8330969
Stats: 38 lines in 5 files changed: 14 ins; 11 del; 13 mod
Patch: https://git.openjdk.org/jdk/pull/18937.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/18937/head:pull/18937
PR: https://git.openjdk.org/jdk/pull/18937
More information about the serviceability-dev
mailing list