RFR: 8304245: Speed up CharacterData.of by avoiding bit shifting in the latin1 fast-path test [v2]
Eirik Bjorsnos
duke at openjdk.org
Wed Mar 15 14:31:09 UTC 2023
On Wed, 15 Mar 2023 13:50:44 GMT, Francesco Nigro <duke at openjdk.org> wrote:
>> I created a randomized version of `Characters.isDigit` which tests with code points picked at random such that any category (Latin1, negative, different planes, unassiged) are equally probable.
>>
>> Baseline:
>>
>>
>> Benchmark (codePoint) Mode Cnt Score Error Units
>> Characters.isDigitRandom 1632 avgt 15 5.503 ± 0.371 ns/op
>>
>>
>> Current PR:
>>
>>
>> Benchmark (codePoint) Mode Cnt Score Error Units
>> Characters.isDigitRandom 1632 avgt 15 5.393 ± 0.336 ns/op
>>
>>
>> Using StringLatin1.canEncode:
>>
>>
>> Benchmark (codePoint) Mode Cnt Score Error Units
>> Characters.isDigitRandom 1632 avgt 15 5.377 ± 0.322 ns/op
>>
>>
>> Seems the PR still has a small improvement for this scenario. The StringLatin1.canEncode regression disappears.
>>
>> In the real world ASCII/Latin1 seems to dominate most data, so this scenario is perhaps not very realistic.
>>
>> I'm running this on a Mac, so cannot try `-prof perfnorm`.
>
> Many thanks to have tried, yep, I was curious indeed re the "StringLatin1.canEncode regression" case.
> I would still modify the benchmark to use inputs (I know that will make it memory bound sadly, due to reading inputs - but the size of such inputs can be a benchmark parameter, together with the bias eg "latin","mix", "non-latin") "semi-randomly" generated based on the mentioned strategies/biases.
> It will benefit future tests on this, although could be provided as a separate PR.
> The StringLatin1.canEncode regression disappears.
I mixed things up so StringLatin1.canEncode was benchmarked without the updated code.
Here are updated benchmark results:
Baseline:
Benchmark (codePoint) Mode Cnt Score Error Units
Characters.isDigitRandom 1632 avgt 15 5.437 ± 0.235 ns/op
PR:
Benchmark (codePoint) Mode Cnt Score Error Units
Characters.isDigitRandom 1632 avgt 15 5.319 ± 0.341 ns/op
StringLatin1.canEncode:
Benchmark (codePoint) Mode Cnt Score Error Units
Characters.isDigitRandom 1632 avgt 15 5.447 ± 0.304 ns/op
```
So it seems using StringLatin1.canEncode still might have a regression also in the randomized input case.
For this PR, I suggest we update StringLatin1.canEncode to be in sync with CharacterData.of, without one calling the other. If anyone wants to investigate the regression further, than can be done outside this PR.
I have independently verified that StringLatin1.canEncode sees performance improvements using the StringIndexOf benchmark.
-------------
PR: https://git.openjdk.org/jdk/pull/13040
More information about the core-libs-dev
mailing list