RFR: 8072070: Improve interpreter stack banging
Xin Liu
xliu at openjdk.java.net
Sat Feb 5 09:21:09 UTC 2022
On Sat, 5 Feb 2022 08:13:34 GMT, Xin Liu <xliu at openjdk.org> wrote:
>> This is an old issue, I submitted the first RFE about this back in 2015. This shows up every time I benchmark the interpreter-only code. Most recently, it showed up in my work to get `java.lang.invoke` infra work reasonably fast when cold, which includes lots of interpreter paths.
>>
>> The underlying problem is that template interpreters rebang the entire shadow zone on every method entry. This takes tens of instructions, blows out TLB caches with accessing tens of pages (on some implementations, I reckon, almost the entire L1 TLB cache!), etc. I think we can make it universally better for all template interpreters by introducing the safe limit / growth watermarks for thread stacks, so that we bang only when needed. It also drops the need for special-casing the `native_call`, because we might as well bang the entire shadow zone in native case as well.
>>
>> This patch makes a pilot change for x86, without touching other architectures. Other architectures can follow this example later. This is why `native_call` argument persists, even though it is not used in x86 case anymore. There is also a new test group that I found useful when debugging on Windows, that group is going to go away before integration.
>>
>> I tried to capture the current mechanics of stack banging in `stackOverflow.hpp`, hoping the change becomes more obvious, and so that arch-specific template interpreter codes could just reference it without copy-pasting it around.
>>
>> I think it is fairly complete, and so would like to solicit more feedback and testing here.
>>
>> Point runs on SPECjvm2008 with `-Xint` shows huge improvements on half of the tests, without any regressions:
>>
>>
>> compiler.compiler: +77%
>> compiler.sunflow: +69%
>> compress: +166%
>> crypto.rsa: +15%
>> crypto.signverify: +70%
>> mpegaudio: +8%
>> serial: +50%
>> sunflow: +57%
>> xml.transform: +61%
>> xml.validation: +43%
>>
>>
>> My new `java.lang.invoke` benchmarks improve a lot as well:
>>
>>
>> Benchmark Mode Cnt Score Error Units
>>
>> # Mainline
>> MHInvoke.methodHandle avgt 5 799.671 ± 9.087 ns/op
>> MHInvoke.plain avgt 5 261.947 ± 1.421 ns/op
>> VHGet.plain avgt 5 231.372 ± 3.044 ns/op
>> VHGet.varHandle avgt 5 924.880 ± 6.026 ns/op
>>
>> # This WIP
>> MHInvoke.methodHandle avgt 5 240.456 ± 3.931 ns/op
>> MHInvoke.plain avgt 5 70.851 ± 1.986 ns/op
>> VHGet.plain avgt 5 52.506 ± 3.768 ns/op
>> VHGet.varHandle avgt 5 335.785 ± 4.398 ns/op
>>
>>
>> It also palpably improves startup even on small HelloWorld, _even when compilers are present_:
>>
>>
>> $ perf stat -r 5000 build/baseline/bin/java -Xms128m -Xmx128m Hello > /dev/null
>>
>> Performance counter stats for 'build/baseline/bin/java -Xms128m -Xmx128m Hello' (5000 runs):
>>
>> 22.06 msec task-clock # 1.030 CPUs utilized ( +- 0.04% )
>> 96 context-switches # 4.353 K/sec ( +- 0.07% )
>> 7 cpu-migrations # 333.181 /sec ( +- 0.32% )
>> 2,437 page-faults # 110.469 K/sec ( +- 0.00% )
>> 78,763,038 cycles # 3.571 GHz ( +- 0.05% ) (77.30%)
>> 2,107,182 stalled-cycles-frontend # 2.68% frontend cycles idle ( +- 0.41% ) (77.40%)
>> 2,235,371 stalled-cycles-backend # 2.84% backend cycles idle ( +- 1.05% ) (71.39%)
>> 67,296,528 instructions # 0.85 insn per cycle
>> # 0.03 stalled cycles per insn ( +- 0.03% ) (89.79%)
>> 12,483,022 branches # 565.911 M/sec ( +- 0.01% ) (99.73%)
>> 384,412 branch-misses # 3.08% of all branches ( +- 0.07% ) (85.91%)
>>
>> 0.0214224 +- 0.0000875 seconds time elapsed ( +- 0.41% )
>>
>> $ perf stat -r 5000 build/interp-bang/bin/java -Xms128m -Xmx128m Hello > /dev/null
>>
>> Performance counter stats for 'build/interp-bang/bin/java -Xms128m -Xmx128m Hello' (5000 runs):
>>
>> 21.78 msec task-clock # 1.031 CPUs utilized ( +- 0.05% )
>> 98 context-switches # 4.519 K/sec ( +- 0.07% )
>> 7 cpu-migrations # 339.292 /sec ( +- 0.31% )
>> 2,434 page-faults # 111.755 K/sec ( +- 0.00% )
>> 77,746,317 cycles # 3.569 GHz ( +- 0.05% ) (76.94%)
>> 2,143,121 stalled-cycles-frontend # 2.76% frontend cycles idle ( +- 0.45% ) (76.03%)
>> 2,059,440 stalled-cycles-backend # 2.65% backend cycles idle ( +- 1.11% ) (71.82%)
>> 66,742,892 instructions # 0.86 insn per cycle
>> # 0.03 stalled cycles per insn ( +- 0.03% ) (91.40%)
>> 12,494,797 branches # 573.634 M/sec ( +- 0.01% ) (99.80%)
>> 386,145 branch-misses # 3.09% of all branches ( +- 0.08% ) (85.56%)
>>
>> 0.0211278 +- 0.0000877 seconds time elapsed ( +- 0.42% )
>>
>>
>> Additional testing:
>> - [x] Linux x86_64 fastdebug, `tier1`
>> - [x] Linux x86_64 fastdebug, `tier2`
>> - [x] Linux x86_64 fastdebug, `tier3`
>> - [x] Linux x86_32 fastdebug, `tier1`
>> - [x] Linux x86_32 fastdebug, `tier2`
>> - [x] Linux x86_32 fastdebug, `tier3`
>
> src/hotspot/share/runtime/stackOverflow.hpp line 166:
>
>> 164: // into adjacent thread stack, or even into other readable memory. This would potentially
>> 165: // pass the check by accident.
>> 166: // c) Allow for incremental stack growth by handling traps from not yet committed thread
>
> I failed to understand why we have to do "incremental stack growth" here. Why can't use touch the last page?
>
> __ bang_stack_with_offset(n_shadow_pages*page_size);
>
>
> The entire shadow zone is mapped. Touching it causes commit, page faults or SEGV. First 2 events are transparent for the userspace process.
>
> Hotspot will trap into the signal handler if `bang_stack_shadow_pages` does cross shadow_zone_safe_limit(). `rsp + n_shadow_pages * page_size` falls into 2 possibilities:
> 1. red zone: the program is about to die anyway.
> 2. yellow reserved zones, both are recoverable.
>
> I feel it's not necessary to touch pages from 1 to n_shadow_pages-1. The side effect is same as touching the last page directly.
>
> ps: I tried this [idea](https://github.com/navyxliu/jdk/runs/5075962312?check_suite_focus=true). 2 failures are found on Windows. I guess the premise that the shadow zone is mapped is false on Windows.
>
> compiler/interpreter/cr7116216/StackOverflow.java
> compiler/uncommontrap/UncommonTrapStackBang.java
I read this blogpost and I need to take back my comment.
https://pangin.pro/posts/stack-overflow-handling
now I think interpreter has to do linear probing to make sure HotSpot executes Java programs correctly. reserve_zone has special meaning. Further, if rsp is very close to the shadow_zone_safe_limit(), so-called last page may surpass red zone.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7247
More information about the hotspot-dev
mailing list