RFR: 8072070: Improve interpreter stack banging [v4]
Aleksey Shipilev
shade at openjdk.java.net
Tue Feb 8 07:18:52 UTC 2022
> This is an old issue, I submitted the first RFE about this back in 2015. This shows up every time I benchmark the interpreter-only code. Most recently, it showed up in my work to get `java.lang.invoke` infra work reasonably fast when cold, which includes lots of interpreter paths.
>
> The underlying problem is that template interpreters rebang the entire shadow zone on every method entry. This takes tens of instructions, blows out TLB caches with accessing tens of pages (on some implementations, I reckon, almost the entire L1 TLB cache!), etc. I think we can make it universally better for all template interpreters by introducing the safe limit / growth watermarks for thread stacks, so that we bang only when needed. It also drops the need for special-casing the `native_call`, because we might as well bang the entire shadow zone in native case as well.
>
> This patch makes a pilot change for x86, without touching other architectures. Other architectures can follow this example later. This is why `native_call` argument persists, even though it is not used in x86 case anymore. There is also a new test group that I found useful when debugging on Windows, that group is going to go away before integration.
>
> I tried to capture the current mechanics of stack banging in `stackOverflow.hpp`, hoping the change becomes more obvious, and so that arch-specific template interpreter codes could just reference it without copy-pasting it around.
>
> I think it is fairly complete, and so would like to solicit more feedback and testing here.
>
> Point runs on SPECjvm2008 with `-Xint` shows huge improvements on half of the tests, without any regressions:
>
>
> compiler.compiler: +77%
> compiler.sunflow: +69%
> compress: +166%
> crypto.rsa: +15%
> crypto.signverify: +70%
> mpegaudio: +8%
> serial: +50%
> sunflow: +57%
> xml.transform: +61%
> xml.validation: +43%
>
>
> My new `java.lang.invoke` benchmarks improve a lot as well:
>
>
> Benchmark Mode Cnt Score Error Units
>
> # Mainline
> MHInvoke.methodHandle avgt 5 799.671 ± 9.087 ns/op
> MHInvoke.plain avgt 5 261.947 ± 1.421 ns/op
> VHGet.plain avgt 5 231.372 ± 3.044 ns/op
> VHGet.varHandle avgt 5 924.880 ± 6.026 ns/op
>
> # This WIP
> MHInvoke.methodHandle avgt 5 240.456 ± 3.931 ns/op
> MHInvoke.plain avgt 5 70.851 ± 1.986 ns/op
> VHGet.plain avgt 5 52.506 ± 3.768 ns/op
> VHGet.varHandle avgt 5 335.785 ± 4.398 ns/op
>
>
> It also palpably improves startup even on small HelloWorld, _even when compilers are present_:
>
>
> $ perf stat -r 5000 build/baseline/bin/java -Xms128m -Xmx128m Hello > /dev/null
>
> Performance counter stats for 'build/baseline/bin/java -Xms128m -Xmx128m Hello' (5000 runs):
>
> 22.06 msec task-clock # 1.030 CPUs utilized ( +- 0.04% )
> 96 context-switches # 4.353 K/sec ( +- 0.07% )
> 7 cpu-migrations # 333.181 /sec ( +- 0.32% )
> 2,437 page-faults # 110.469 K/sec ( +- 0.00% )
> 78,763,038 cycles # 3.571 GHz ( +- 0.05% ) (77.30%)
> 2,107,182 stalled-cycles-frontend # 2.68% frontend cycles idle ( +- 0.41% ) (77.40%)
> 2,235,371 stalled-cycles-backend # 2.84% backend cycles idle ( +- 1.05% ) (71.39%)
> 67,296,528 instructions # 0.85 insn per cycle
> # 0.03 stalled cycles per insn ( +- 0.03% ) (89.79%)
> 12,483,022 branches # 565.911 M/sec ( +- 0.01% ) (99.73%)
> 384,412 branch-misses # 3.08% of all branches ( +- 0.07% ) (85.91%)
>
> 0.0214224 +- 0.0000875 seconds time elapsed ( +- 0.41% )
>
> $ perf stat -r 5000 build/interp-bang/bin/java -Xms128m -Xmx128m Hello > /dev/null
>
> Performance counter stats for 'build/interp-bang/bin/java -Xms128m -Xmx128m Hello' (5000 runs):
>
> 21.78 msec task-clock # 1.031 CPUs utilized ( +- 0.05% )
> 98 context-switches # 4.519 K/sec ( +- 0.07% )
> 7 cpu-migrations # 339.292 /sec ( +- 0.31% )
> 2,434 page-faults # 111.755 K/sec ( +- 0.00% )
> 77,746,317 cycles # 3.569 GHz ( +- 0.05% ) (76.94%)
> 2,143,121 stalled-cycles-frontend # 2.76% frontend cycles idle ( +- 0.45% ) (76.03%)
> 2,059,440 stalled-cycles-backend # 2.65% backend cycles idle ( +- 1.11% ) (71.82%)
> 66,742,892 instructions # 0.86 insn per cycle
> # 0.03 stalled cycles per insn ( +- 0.03% ) (91.40%)
> 12,494,797 branches # 573.634 M/sec ( +- 0.01% ) (99.80%)
> 386,145 branch-misses # 3.09% of all branches ( +- 0.08% ) (85.56%)
>
> 0.0211278 +- 0.0000877 seconds time elapsed ( +- 0.42% )
>
>
> Additional testing:
> - [x] Linux x86_64 fastdebug, `tier1`
> - [x] Linux x86_64 fastdebug, `tier2`
> - [x] Linux x86_64 fastdebug, `tier3`
> - [x] Linux x86_32 fastdebug, `tier1`
> - [x] Linux x86_32 fastdebug, `tier2`
> - [x] Linux x86_32 fastdebug, `tier3`
Aleksey Shipilev has updated the pull request incrementally with three additional commits since the last revision:
- Indents
- Drop the test group definition
- Update copyrights
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/7247/files
- new: https://git.openjdk.java.net/jdk/pull/7247/files/2c710882..ffd560ab
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7247&range=03
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7247&range=02-03
Stats: 41 lines in 4 files changed: 0 ins; 28 del; 13 mod
Patch: https://git.openjdk.java.net/jdk/pull/7247.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/7247/head:pull/7247
PR: https://git.openjdk.java.net/jdk/pull/7247
More information about the hotspot-dev
mailing list