RFR: 8072070: Improve interpreter stack banging

Sat Feb 5 01:26:07 UTC 2022

On Thu, 27 Jan 2022 18:42:15 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> This is an old issue, I submitted the first RFE about this back in 2015. This shows up every time I benchmark the interpreter-only code. Most recently, it showed up in my work to get `java.lang.invoke` infra work reasonably fast when cold, which includes lots of interpreter paths.
> 
> The underlying problem is that template interpreters rebang the entire shadow zone on every method entry. This takes tens of instructions, blows out TLB caches with accessing tens of pages (on some implementations, I reckon, almost the entire L1 TLB cache!), etc. I think we can make it universally better for all template interpreters by introducing the safe limit / growth watermarks for thread stacks, so that we bang only when needed. It also drops the need for special-casing the `native_call`, because we might as well bang the entire shadow zone in native case as well.
> 
> This patch makes a pilot change for x86, without touching other architectures. Other architectures can follow this example later. This is why `native_call` argument persists, even though it is not used in x86 case anymore. There is also a new test group that I found useful when debugging on Windows, that group is going to go away before integration.
> 
> I tried to capture the current mechanics of stack banging in `stackOverflow.hpp`, hoping the change becomes more obvious, and so that arch-specific template interpreter codes could just reference it without copy-pasting it around.
> 
> I think it is fairly complete, and so would like to solicit more feedback and testing here.
> 
> Point runs on SPECjvm2008 with `-Xint` shows huge improvements on half of the tests, without any regressions:
> 
> 
>  compiler.compiler: +77%
>  compiler.sunflow: +69%
>  compress: +166%
>  crypto.rsa: +15%
>  crypto.signverify: +70%
>  mpegaudio: +8%
>  serial: +50%
>  sunflow: +57%
>  xml.transform: +61%
>  xml.validation: +43%
> 
> 
> My new `java.lang.invoke` benchmarks improve a lot as well:
> 
> 
> Benchmark              Mode  Cnt    Score    Error  Units
> 
> # Mainline
> MHInvoke.methodHandle  avgt    5  799.671 ± 9.087  ns/op
> MHInvoke.plain         avgt    5  261.947 ± 1.421  ns/op
> VHGet.plain            avgt    5  231.372 ± 3.044  ns/op
> VHGet.varHandle        avgt    5  924.880 ± 6.026  ns/op
> 
> # This WIP
> MHInvoke.methodHandle  avgt    5  240.456 ± 3.931  ns/op
> MHInvoke.plain         avgt    5   70.851 ± 1.986  ns/op
> VHGet.plain            avgt    5   52.506 ± 3.768  ns/op
> VHGet.varHandle        avgt    5  335.785 ± 4.398  ns/op
> 
> 
> It also palpably improves startup even on small HelloWorld, _even when compilers are present_:
> 
> 
> $ perf stat -r 5000 build/baseline/bin/java -Xms128m -Xmx128m Hello > /dev/null
> 
>  Performance counter stats for 'build/baseline/bin/java -Xms128m -Xmx128m Hello' (5000 runs):
> 
>              22.06 msec task-clock                #    1.030 CPUs utilized            ( +-  0.04% )
>                 96      context-switches          #    4.353 K/sec                    ( +-  0.07% )
>                  7      cpu-migrations            #  333.181 /sec                     ( +-  0.32% )
>              2,437      page-faults               #  110.469 K/sec                    ( +-  0.00% )
>         78,763,038      cycles                    #    3.571 GHz                      ( +-  0.05% )  (77.30%)
>          2,107,182      stalled-cycles-frontend   #    2.68% frontend cycles idle     ( +-  0.41% )  (77.40%)
>          2,235,371      stalled-cycles-backend    #    2.84% backend cycles idle      ( +-  1.05% )  (71.39%)
>         67,296,528      instructions              #    0.85  insn per cycle         
>                                                   #    0.03  stalled cycles per insn  ( +-  0.03% )  (89.79%)
>         12,483,022      branches                  #  565.911 M/sec                    ( +-  0.01% )  (99.73%)
>            384,412      branch-misses             #    3.08% of all branches          ( +-  0.07% )  (85.91%)
> 
>          0.0214224 +- 0.0000875 seconds time elapsed  ( +-  0.41% )
> 
> $ perf stat -r 5000 build/interp-bang/bin/java -Xms128m -Xmx128m Hello > /dev/null
> 
>  Performance counter stats for 'build/interp-bang/bin/java -Xms128m -Xmx128m Hello' (5000 runs):
> 
>              21.78 msec task-clock                #    1.031 CPUs utilized            ( +-  0.05% )
>                 98      context-switches          #    4.519 K/sec                    ( +-  0.07% )
>                  7      cpu-migrations            #  339.292 /sec                     ( +-  0.31% )
>              2,434      page-faults               #  111.755 K/sec                    ( +-  0.00% )
>         77,746,317      cycles                    #    3.569 GHz                      ( +-  0.05% )  (76.94%)
>          2,143,121      stalled-cycles-frontend   #    2.76% frontend cycles idle     ( +-  0.45% )  (76.03%)
>          2,059,440      stalled-cycles-backend    #    2.65% backend cycles idle      ( +-  1.11% )  (71.82%)
>         66,742,892      instructions              #    0.86  insn per cycle         
>                                                   #    0.03  stalled cycles per insn  ( +-  0.03% )  (91.40%)
>         12,494,797      branches                  #  573.634 M/sec                    ( +-  0.01% )  (99.80%)
>            386,145      branch-misses             #    3.09% of all branches          ( +-  0.08% )  (85.56%)
> 
>          0.0211278 +- 0.0000877 seconds time elapsed  ( +-  0.42% )
> 
> 
> Additional testing:
>  - [x] Linux x86_64 fastdebug, `tier1`
>  - [x] Linux x86_64 fastdebug, `tier2`
>  - [ ] Linux x86_64 fastdebug, `tier3`
>  - [x] Linux x86_32 fastdebug, `tier1`
>  - [x] Linux x86_32 fastdebug, `tier2`
>  - [x] Linux x86_32 fastdebug, `tier3`

src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp line 715:

> 713: }
> 714: 
> 715: void TemplateInterpreterGenerator::bang_stack_shadow_pages(bool native_call) {

The watermark algorithm should also work on other architectures such as aarch64, right?

-------------

PR: https://git.openjdk.java.net/jdk/pull/7247