Possible optimization in StringLatin1.regionMatchesCI

Tue May 26 11:04:33 UTC 2020

Hi Martin,

> Not a review, but:
> Compare with the variant of this code in StringUTF16.
> StringLatin1 only ever needs to support the first 256 chars in Unicode

Does it really? That makes me wonder even more about the additional lowercase check.

> which can never change, unlike StringUTF16,

What do you mean by "can never change"?

> Do all the String tests still pass if you simplify the code?

You mean if I remove the additional lowercasing completely!?

--- a/src/java.base/share/classes/java/lang/StringLatin1.java   Tue May 26 09:25:23 2020 +0200
+++ b/src/java.base/share/classes/java/lang/StringLatin1.java   Tue May 26 13:01:14 2020 +0200
@@ -396,9 +396,6 @@
             if (u1 == u2) {
                 continue;
             }
-            if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
-                continue;
-            }
             return false;
         }
         return true;

Tier1 seems to pass still (apart from some tests that don't seem to like my German setup - also without the change)

> Should CharacterDataLatin1 have a method to compare two characters
> case-insensitively?

What do you mean by that? It needs to keep at least one check for the regionMatchesCI method.

> Be careful with Latin Small Letter sharp S
That seems to stay 223 regardless of uppercasing or lowercasing it.

I'm afraid your answer raised more question marks than it actually solved. ;-)

Cheers,
Christoph

>On Mon, May 25, 2020 at 2:16 PM Christoph Dreis
><christoph.dreis at freenet.de> wrote:
>>
>> Hi,
>>
>> I've recently looked through the StringLatin1 code - specifically regionMatchesCI.
>>
>> I think I have an optimization, but would need someone with more domain knowledge to verify if I'm missing nothing.
>>
>> Currently, the code does a conversion to uppercase and if that doesn't match it does an additional comparison of the lowercase characters.
>> What's confusing to me is that there are actually both upper- and lowercase checks needed, but that might be explained by the comment in the UTF-16 version about the Georgian alphabet.
>>
>> Assuming that the additional lowercase check is needed, I was wondering if this must be on the uppercase variant. Wouldn't it be faster on the character itself to avoid potentially converting a lowercase character to an uppercase character and back?
>>
>> I think code is actually better explaining what I'm suggesting:
>>
>> --- a/src/java.base/share/classes/java/lang/StringLatin1.java   Wed May 13 16:18:16 2020 +0200
>> +++ b/src/java.base/share/classes/java/lang/StringLatin1.java   Mon May 25 22:59:13 2020 +0200
>> @@ -396,7 +396,7 @@
>>              if (u1 == u2) {
>>                  continue;
>>              }
>> -            if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
>> +            if (Character.toLowerCase(c1) == Character.toLowerCase(c2)) {
>>                  continue;
>>              }
>>              return false;
>>
>>
>> And indeed the newer version seems to be faster if I use the following benchmark:
>>
>> @BenchmarkMode(Mode.AverageTime)
>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>> public class MyBenchmark {
>>
>>     @State(Scope.Benchmark)
>>     public static class ThreadState {
>>         private String test1 = "test";
>>         private String test2 = "best";
>>     }
>>
>>     @Benchmark
>>     public boolean test(ThreadState threadState) {
>>         return threadState.test1.equalsIgnoreCase(threadState.test2);
>>     }
>>
>> }
>>
>> Benchmark                                      Mode  Cnt   Score    Error   Units
>> MyBenchmark.testOld                  avgt   10   8,843 ±  0,274   ns/op
>> MyBenchmark.testPatched          avgt   10   7,067 ±  0,177   ns/op
>>
>> Does this make sense? Do I miss something here? I would appreciate if someone can either explain the shortcomings of the solution above or - in case there aren't any - can maybe sponsor it.
>>
>> Cheers,
>> Christoph
>>
>>