Possible optimization in StringLatin1.regionMatchesCI
Christoph Dreis
christoph.dreis at freenet.de
Tue May 26 11:04:33 UTC 2020
Hi Martin,
> Not a review, but:
> Compare with the variant of this code in StringUTF16.
> StringLatin1 only ever needs to support the first 256 chars in Unicode
Does it really? That makes me wonder even more about the additional lowercase check.
> which can never change, unlike StringUTF16,
What do you mean by "can never change"?
> Do all the String tests still pass if you simplify the code?
You mean if I remove the additional lowercasing completely!?
--- a/src/java.base/share/classes/java/lang/StringLatin1.java Tue May 26 09:25:23 2020 +0200
+++ b/src/java.base/share/classes/java/lang/StringLatin1.java Tue May 26 13:01:14 2020 +0200
@@ -396,9 +396,6 @@
if (u1 == u2) {
continue;
}
- if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
- continue;
- }
return false;
}
return true;
Tier1 seems to pass still (apart from some tests that don't seem to like my German setup - also without the change)
> Should CharacterDataLatin1 have a method to compare two characters
> case-insensitively?
What do you mean by that? It needs to keep at least one check for the regionMatchesCI method.
> Be careful with Latin Small Letter sharp S
That seems to stay 223 regardless of uppercasing or lowercasing it.
I'm afraid your answer raised more question marks than it actually solved. ;-)
Cheers,
Christoph
>On Mon, May 25, 2020 at 2:16 PM Christoph Dreis
><christoph.dreis at freenet.de> wrote:
>>
>> Hi,
>>
>> I've recently looked through the StringLatin1 code - specifically regionMatchesCI.
>>
>> I think I have an optimization, but would need someone with more domain knowledge to verify if I'm missing nothing.
>>
>> Currently, the code does a conversion to uppercase and if that doesn't match it does an additional comparison of the lowercase characters.
>> What's confusing to me is that there are actually both upper- and lowercase checks needed, but that might be explained by the comment in the UTF-16 version about the Georgian alphabet.
>>
>> Assuming that the additional lowercase check is needed, I was wondering if this must be on the uppercase variant. Wouldn't it be faster on the character itself to avoid potentially converting a lowercase character to an uppercase character and back?
>>
>> I think code is actually better explaining what I'm suggesting:
>>
>> --- a/src/java.base/share/classes/java/lang/StringLatin1.java Wed May 13 16:18:16 2020 +0200
>> +++ b/src/java.base/share/classes/java/lang/StringLatin1.java Mon May 25 22:59:13 2020 +0200
>> @@ -396,7 +396,7 @@
>> if (u1 == u2) {
>> continue;
>> }
>> - if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
>> + if (Character.toLowerCase(c1) == Character.toLowerCase(c2)) {
>> continue;
>> }
>> return false;
>>
>>
>> And indeed the newer version seems to be faster if I use the following benchmark:
>>
>> @BenchmarkMode(Mode.AverageTime)
>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>> public class MyBenchmark {
>>
>> @State(Scope.Benchmark)
>> public static class ThreadState {
>> private String test1 = "test";
>> private String test2 = "best";
>> }
>>
>> @Benchmark
>> public boolean test(ThreadState threadState) {
>> return threadState.test1.equalsIgnoreCase(threadState.test2);
>> }
>>
>> }
>>
>> Benchmark Mode Cnt Score Error Units
>> MyBenchmark.testOld avgt 10 8,843 ± 0,274 ns/op
>> MyBenchmark.testPatched avgt 10 7,067 ± 0,177 ns/op
>>
>> Does this make sense? Do I miss something here? I would appreciate if someone can either explain the shortcomings of the solution above or - in case there aren't any - can maybe sponsor it.
>>
>> Cheers,
>> Christoph
>>
>>
More information about the core-libs-dev
mailing list