RFR: 8280124: Reduce branches decoding latin-1 chars from UTF-8 encoded bytes

Tue Jan 18 21:28:29 UTC 2022

It’s kind of sad that you have to do this by hand.
The JVM should recognize that those back-to-back
if statements (in the bytecode) can be combined
into a non-branching O(1) test.

One item on my (very long) wish list for the JVM
is giving more comprehensive attention to the task
of testing membership in small finite sets.  This
one asks whether a given byte is one of two values,
where they happen to be consecutive and also have
almost the same bit pattern.  A more general version
of the problem would be any switch of this form:

switch (x) {
case 2: case 3: case 5: case 7: case 16:
default: return false;
}

As you can see, this is again a test for membership
in a small, statically determined set.  As long as
the static set is “small enough” (a flexible definition)
there is always a way that is “very fast” (flexible again)
to detect set membership, and in particular faster than
a branch tree (unless you know which branch is very likely
to be taken).  The “common bad” case is having to generate
a small perfect hash function, based on a multiply.  The
best case is an ad hoc detection of an arithmetic or
bitwise pattern, as you have done here by hand.

If the JVM had such a set-test generator, it could (a) turn
branch nests (from switches or from random user code) into
O(1) set tests, and (b) also supply a method handle factory
for them, which would be a very useful way to translate
indy-mediated discrimination trees that underly
pattern-matching switches.  Also it could translate
character class tests in the regex package.  (I started
thinking about this long ago when staring at hand-written
switches in Java code that implement character classes.)

— John

On 18 Jan 2022, at 2:44, Claes Redestad wrote:

> On Tue, 18 Jan 2022 10:08:35 GMT, Claes Redestad <redestad at openjdk.org> wrote:
>
>> This resolves minor inefficiency in the fast-path for decoding latin-1 chars from UTF-8. I also took the opportunity to refactor the StringDecode microbenchmark to align with recent changes to the StringEncode micro.
>>
>> The inefficiency is that this test is quite branchy:
>>
>> `if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) && ...`
>>
>> Since the two constant bytes differ only on the lowest bit this can be transformed to this, saving us a branch:
>>
>> `if ((b1 & 0xfe) == 0xc2 && ...`
>>
>> This provides a small speed-up on microbenchmarks where the input can be internally encoded as latin1:
>>
>>
>> Benchmark (charsetName) Mode Cnt Score Error Units
>> StringDecode.decodeLatin1LongStart UTF-8 avgt 50 2283.591 ± 12.332 ns/op
>>
>> StringDecode.decodeLatin1LongStart UTF-8 avgt 50 2165.984 ± 13.136 ns/op
>
> On a microbenchmark that zooms in on the logical predicate the speed-up is closer to 2x. This seems like a transformation a JIT could do automatically. gcc and clang doesn't do it, but icc seem to pull it off (as tested via godbolt.org). It's unclear if this is common enough to motivate such enhancement work, but it might be of academic interest to attempt it.
>
> -------------
>
> PR: https://git.openjdk.java.net/jdk/pull/7122