Possible optimization in StringLatin1.regionMatchesCI

Mon May 25 22:18:56 UTC 2020

Not a review, but:
Compare with the variant of this code in StringUTF16.
StringLatin1 only ever needs to support the first 256 chars in Unicode
which can never change, unlike StringUTF16,
Do all the String tests still pass if you simplify the code?
Should CharacterDataLatin1 have a method to compare two characters
case-insensitively?
Be careful with Latin Small Letter sharp S

On Mon, May 25, 2020 at 2:16 PM Christoph Dreis
<christoph.dreis at freenet.de> wrote:
>
> Hi,
>
> I've recently looked through the StringLatin1 code - specifically regionMatchesCI.
>
> I think I have an optimization, but would need someone with more domain knowledge to verify if I'm missing nothing.
>
> Currently, the code does a conversion to uppercase and if that doesn't match it does an additional comparison of the lowercase characters.
> What's confusing to me is that there are actually both upper- and lowercase checks needed, but that might be explained by the comment in the UTF-16 version about the Georgian alphabet.
>
> Assuming that the additional lowercase check is needed, I was wondering if this must be on the uppercase variant. Wouldn't it be faster on the character itself to avoid potentially converting a lowercase character to an uppercase character and back?
>
> I think code is actually better explaining what I'm suggesting:
>
> --- a/src/java.base/share/classes/java/lang/StringLatin1.java   Wed May 13 16:18:16 2020 +0200
> +++ b/src/java.base/share/classes/java/lang/StringLatin1.java   Mon May 25 22:59:13 2020 +0200
> @@ -396,7 +396,7 @@
>              if (u1 == u2) {
>                  continue;
>              }
> -            if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
> +            if (Character.toLowerCase(c1) == Character.toLowerCase(c2)) {
>                  continue;
>              }
>              return false;
>
>
> And indeed the newer version seems to be faster if I use the following benchmark:
>
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> public class MyBenchmark {
>
>     @State(Scope.Benchmark)
>     public static class ThreadState {
>         private String test1 = "test";
>         private String test2 = "best";
>     }
>
>     @Benchmark
>     public boolean test(ThreadState threadState) {
>         return threadState.test1.equalsIgnoreCase(threadState.test2);
>     }
>
> }
>
> Benchmark                                      Mode  Cnt   Score    Error   Units
> MyBenchmark.testOld                  avgt   10   8,843 ±  0,274   ns/op
> MyBenchmark.testPatched          avgt   10   7,067 ±  0,177   ns/op
>
> Does this make sense? Do I miss something here? I would appreciate if someone can either explain the shortcomings of the solution above or - in case there aren't any - can maybe sponsor it.
>
> Cheers,
> Christoph
>
>