RFR: 8072070: Improve interpreter stack banging

Sat Feb 5 09:21:09 UTC 2022

On Sat, 5 Feb 2022 08:13:34 GMT, Xin Liu <xliu at openjdk.org> wrote:

>> This is an old issue, I submitted the first RFE about this back in 2015. This shows up every time I benchmark the interpreter-only code. Most recently, it showed up in my work to get `java.lang.invoke` infra work reasonably fast when cold, which includes lots of interpreter paths.
>> 
>> The underlying problem is that template interpreters rebang the entire shadow zone on every method entry. This takes tens of instructions, blows out TLB caches with accessing tens of pages (on some implementations, I reckon, almost the entire L1 TLB cache!), etc. I think we can make it universally better for all template interpreters by introducing the safe limit / growth watermarks for thread stacks, so that we bang only when needed. It also drops the need for special-casing the `native_call`, because we might as well bang the entire shadow zone in native case as well.
>> 
>> This patch makes a pilot change for x86, without touching other architectures. Other architectures can follow this example later. This is why `native_call` argument persists, even though it is not used in x86 case anymore. There is also a new test group that I found useful when debugging on Windows, that group is going to go away before integration.
>> 
>> I tried to capture the current mechanics of stack banging in `stackOverflow.hpp`, hoping the change becomes more obvious, and so that arch-specific template interpreter codes could just reference it without copy-pasting it around.
>> 
>> I think it is fairly complete, and so would like to solicit more feedback and testing here.
>> 
>> Point runs on SPECjvm2008 with `-Xint` shows huge improvements on half of the tests, without any regressions:
>> 
>> 
>>  compiler.compiler: +77%
>>  compiler.sunflow: +69%
>>  compress: +166%
>>  crypto.rsa: +15%
>>  crypto.signverify: +70%
>>  mpegaudio: +8%
>>  serial: +50%
>>  sunflow: +57%
>>  xml.transform: +61%
>>  xml.validation: +43%
>> 
>> 
>> My new `java.lang.invoke` benchmarks improve a lot as well:
>> 
>> 
>> Benchmark              Mode  Cnt    Score    Error  Units
>> 
>> # Mainline
>> MHInvoke.methodHandle  avgt    5  799.671 ± 9.087  ns/op
>> MHInvoke.plain         avgt    5  261.947 ± 1.421  ns/op
>> VHGet.plain            avgt    5  231.372 ± 3.044  ns/op
>> VHGet.varHandle        avgt    5  924.880 ± 6.026  ns/op
>> 
>> # This WIP
>> MHInvoke.methodHandle  avgt    5  240.456 ± 3.931  ns/op
>> MHInvoke.plain         avgt    5   70.851 ± 1.986  ns/op
>> VHGet.plain            avgt    5   52.506 ± 3.768  ns/op
>> VHGet.varHandle        avgt    5  335.785 ± 4.398  ns/op
>> 
>> 
>> It also palpably improves startup even on small HelloWorld, _even when compilers are present_:
>> 
>> 
>> $ perf stat -r 5000 build/baseline/bin/java -Xms128m -Xmx128m Hello > /dev/null
>> 
>>  Performance counter stats for 'build/baseline/bin/java -Xms128m -Xmx128m Hello' (5000 runs):
>> 
>>              22.06 msec task-clock                #    1.030 CPUs utilized            ( +-  0.04% )
>>                 96      context-switches          #    4.353 K/sec                    ( +-  0.07% )
>>                  7      cpu-migrations            #  333.181 /sec                     ( +-  0.32% )
>>              2,437      page-faults               #  110.469 K/sec                    ( +-  0.00% )
>>         78,763,038      cycles                    #    3.571 GHz                      ( +-  0.05% )  (77.30%)
>>          2,107,182      stalled-cycles-frontend   #    2.68% frontend cycles idle     ( +-  0.41% )  (77.40%)
>>          2,235,371      stalled-cycles-backend    #    2.84% backend cycles idle      ( +-  1.05% )  (71.39%)
>>         67,296,528      instructions              #    0.85  insn per cycle         
>>                                                   #    0.03  stalled cycles per insn  ( +-  0.03% )  (89.79%)
>>         12,483,022      branches                  #  565.911 M/sec                    ( +-  0.01% )  (99.73%)
>>            384,412      branch-misses             #    3.08% of all branches          ( +-  0.07% )  (85.91%)
>> 
>>          0.0214224 +- 0.0000875 seconds time elapsed  ( +-  0.41% )
>> 
>> $ perf stat -r 5000 build/interp-bang/bin/java -Xms128m -Xmx128m Hello > /dev/null
>> 
>>  Performance counter stats for 'build/interp-bang/bin/java -Xms128m -Xmx128m Hello' (5000 runs):
>> 
>>              21.78 msec task-clock                #    1.031 CPUs utilized            ( +-  0.05% )
>>                 98      context-switches          #    4.519 K/sec                    ( +-  0.07% )
>>                  7      cpu-migrations            #  339.292 /sec                     ( +-  0.31% )
>>              2,434      page-faults               #  111.755 K/sec                    ( +-  0.00% )
>>         77,746,317      cycles                    #    3.569 GHz                      ( +-  0.05% )  (76.94%)
>>          2,143,121      stalled-cycles-frontend   #    2.76% frontend cycles idle     ( +-  0.45% )  (76.03%)
>>          2,059,440      stalled-cycles-backend    #    2.65% backend cycles idle      ( +-  1.11% )  (71.82%)
>>         66,742,892      instructions              #    0.86  insn per cycle         
>>                                                   #    0.03  stalled cycles per insn  ( +-  0.03% )  (91.40%)
>>         12,494,797      branches                  #  573.634 M/sec                    ( +-  0.01% )  (99.80%)
>>            386,145      branch-misses             #    3.09% of all branches          ( +-  0.08% )  (85.56%)
>> 
>>          0.0211278 +- 0.0000877 seconds time elapsed  ( +-  0.42% )
>> 
>> 
>> Additional testing:
>>  - [x] Linux x86_64 fastdebug, `tier1`
>>  - [x] Linux x86_64 fastdebug, `tier2`
>>  - [x] Linux x86_64 fastdebug, `tier3`
>>  - [x] Linux x86_32 fastdebug, `tier1`
>>  - [x] Linux x86_32 fastdebug, `tier2`
>>  - [x] Linux x86_32 fastdebug, `tier3`
>
> src/hotspot/share/runtime/stackOverflow.hpp line 166:
> 
>> 164:   //     into adjacent thread stack, or even into other readable memory. This would potentially
>> 165:   //     pass the check by accident.
>> 166:   //  c) Allow for incremental stack growth by handling traps from not yet committed thread
> 
> I failed to understand why we have to do "incremental stack growth" here.  Why can't use touch the last page? 
> 
> __ bang_stack_with_offset(n_shadow_pages*page_size);
> 
> 
> The entire shadow zone is mapped. Touching it causes commit, page faults or SEGV. First 2 events are transparent for the userspace process.
> 
> Hotspot will trap into the signal handler if `bang_stack_shadow_pages` does cross shadow_zone_safe_limit().  `rsp + n_shadow_pages * page_size` falls into 2 possibilities: 
> 1. red zone:  the program is about to die anyway. 
> 2. yellow reserved zones, both are recoverable.
> 
> I feel it's not necessary to touch pages from 1 to n_shadow_pages-1. The side effect is same as touching the last page directly.
> 
> ps: I tried this [idea](https://github.com/navyxliu/jdk/runs/5075962312?check_suite_focus=true). 2 failures are found on Windows. I guess the premise that the shadow zone is mapped is false on Windows. 
> 
> compiler/interpreter/cr7116216/StackOverflow.java 
> compiler/uncommontrap/UncommonTrapStackBang.java

I read this blogpost and I need to take back my comment. 
https://pangin.pro/posts/stack-overflow-handling

now I think interpreter has to do linear probing to make sure HotSpot executes Java programs correctly. reserve_zone has special meaning.  Further, if rsp is very close to the shadow_zone_safe_limit(), so-called last page may surpass red zone.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7247