RFR: 8307351: (CmpI/L(AndI/L reg1 reg2)) on x86 can be optimized [v2]

Tue May 9 09:24:11 UTC 2023

> This patch aims to optimize a case where a And-Node followed by a Cmp-Node would not be converted into a single "test" instruction. The Architecture Description file currently only handles the cases where the right operand of the And-Node is a constant, but not if both are a register.
> Before this patch, a "and" followed by a "test" would be emitted, so the removed "and" means 2 bytes less have to be emitted.
> I've attached a JMH Benchmark to demonstrate the performance improvements. Here are the numbers of my Windows 11 machine:
> Before:
> 
> Benchmark                                                   Mode  Cnt   Score   Error  Units
> AndCmpTestInstruction.benchmarkOpaqueAndCmpEqualsInt        avgt    8  26,736 ± 0,131  ns/op
> AndCmpTestInstruction.benchmarkOpaqueAndCmpEqualsLong       avgt    8  24,305 ± 0,610  ns/op
> AndCmpTestInstruction.benchmarkStaticAndCmpEqualsInt        avgt    8  33,052 ± 0,056  ns/op
> AndCmpTestInstruction.benchmarkStaticLargeAndCmpEqualsLong  avgt    8  18,355 ± 0,030  ns/op
> AndCmpTestInstruction.benchmarkStaticSmallAndCmpEqualsLong  avgt    8  25,587 ± 0,107  ns/op
> 
> After:
> 
> Benchmark                                                   Mode  Cnt   Score   Error  Units  Improvement
> AndCmpTestInstruction.benchmarkOpaqueAndCmpEqualsInt        avgt    8  22,665 ± 0,170  ns/op  (~18%)
> AndCmpTestInstruction.benchmarkOpaqueAndCmpEqualsLong       avgt    8  18,880 ± 0,123  ns/op  (~29%)
> AndCmpTestInstruction.benchmarkStaticAndCmpEqualsInt        avgt    8  33,198 ± 0,126  ns/op  (unchanged)
> AndCmpTestInstruction.benchmarkStaticLargeAndCmpEqualsLong  avgt    8  18,427 ± 0,079  ns/op  (unchanged)
> AndCmpTestInstruction.benchmarkStaticSmallAndCmpEqualsLong  avgt    8  25,641 ± 0,168  ns/op  (unchanged)
> 
> As you can see, the cases with a small static mask have not improved as they have already been covered by another match rule. The test with the large static mask (benchmarkStaticLargeAndCmpEqualsLong) has a sorted instruction sequence as the value is moved to a register and it is not being used directly in the instruction, but there is no measurable performance uplift here.
> I've tested my changes using the Tier1 jtreg Tests on Windows.

Tobias Hotz has updated the pull request incrementally with one additional commit since the last revision:

  Update benchmark copyright and remove invalid copypasted comment

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/13587/files
  - new: https://git.openjdk.org/jdk/pull/13587/files/b7cc690d..04a4118e

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=13587&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13587&range=00-01

  Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/13587.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13587/head:pull/13587

PR: https://git.openjdk.org/jdk/pull/13587