<i18n dev> RFR: 8304245: Speed up CharacterData.of by avoiding bit shifting in the latin1 fast-path test [v2]

Wed Mar 15 14:31:09 UTC 2023

On Wed, 15 Mar 2023 13:50:44 GMT, Francesco Nigro <duke at openjdk.org> wrote:

>> I created a randomized version of `Characters.isDigit` which tests with code points picked at random such that any category (Latin1, negative, different planes, unassiged) are equally probable.
>> 
>> Baseline:
>> 
>> 
>> Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
>> Characters.isDigitRandom         1632  avgt   15  5.503 ± 0.371  ns/op
>> 
>> 
>> Current PR:
>> 
>> 
>> Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
>> Characters.isDigitRandom         1632  avgt   15  5.393 ± 0.336  ns/op
>> 
>> 
>> Using StringLatin1.canEncode:
>> 
>> 
>> Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
>> Characters.isDigitRandom         1632  avgt   15  5.377 ± 0.322  ns/op
>> 
>> 
>> Seems the PR still has a small improvement for this scenario. The StringLatin1.canEncode regression disappears.
>> 
>> In the real world ASCII/Latin1 seems to dominate most data, so this scenario is perhaps not very realistic.
>> 
>> I'm running this on a Mac, so cannot try `-prof perfnorm`.
>
> Many thanks to have tried, yep, I was curious indeed re the "StringLatin1.canEncode regression" case.
> I would still modify the benchmark to use inputs (I know that will make it memory bound sadly, due to reading inputs - but the size of such inputs can be a benchmark parameter, together with the bias eg "latin","mix", "non-latin") "semi-randomly" generated based on the mentioned strategies/biases. 
> It will benefit future tests on this, although could be provided as a separate PR.

> The StringLatin1.canEncode regression disappears.

I mixed things up so StringLatin1.canEncode was benchmarked without the updated code.

Here are updated benchmark results:

Baseline:

Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
Characters.isDigitRandom         1632  avgt   15  5.437 ± 0.235  ns/op

PR:

Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
Characters.isDigitRandom         1632  avgt   15  5.319 ± 0.341  ns/op

StringLatin1.canEncode:

Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
Characters.isDigitRandom         1632  avgt   15  5.447 ± 0.304  ns/op
``` 

So it seems using StringLatin1.canEncode still might have a regression also in the randomized input case.

For this PR, I suggest we update StringLatin1.canEncode to be in sync with CharacterData.of, without one calling the other. If anyone wants to investigate the regression further, than can be done outside this PR.

I have independently verified that StringLatin1.canEncode sees performance improvements using the StringIndexOf benchmark.

-------------

PR: https://git.openjdk.org/jdk/pull/13040