<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

Tue Jul 21 21:01:35 UTC 2020

Thank you, Joe. I got it now. Will try out and benchmark.

Naoto

On 7/21/20 10:05 AM, Joe Wang wrote:
> 
> 
> On 7/20/2020 8:58 PM, naoto.sato at oracle.com wrote:
>>> The short-cut worked well. There's maybe a further optimization we 
>>> could do to rid us of the performance concern (or the need to run 
>>> additional performance tests). Consider the cases where strings in 
>>> comparison don't contain any sup characters, if we make the 
>>> toLower/UpperCase() block a method and call it before and after the 
>>> surrogate-check block, the routine would be effectively the same as 
>>> prior to the introduction of the surrogate-check block, and regular 
>>> comparisons would suffer the surrogate-check only once (the last 
>>> check). For strings that do contain sup characters then, the 
>>> toLower/UpperCase() method would have been called twice, but then we 
>>> don't care about the performance in that situation. You may call the 
>>> existing codePointAt method too when an extra getChar and performance 
>>> is not a concern (but that's your call.
>>
>> Can you please elaborate this more? What's "the last check" here?
> 
> What I meant was that we could expand the 'short-cut' from case 
> sensitive to case insensitive, that is in addition to the line 337, do 
> that line 353 - 370 case-insensitive check as well.
> 
> I guess it can be explained better with code. I added inline comment:
> 
>          for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast; 
> k1++, k2++) {
>              int cp1 = (int)getChar(value, k1);
>              int cp2 = (int)getChar(other, k2);
> 
> // does a case-insensitive check:
> 
>              if (checkEqual(cp1, cp2) == 0) {
>                  continue;
>              }
> 
> // this block will be run once for strings that don't contain any 
> supplementary characters
> 
>               // Check for supplementary characters case
>              cp1 = getSupplementaryCodePoint(value, cp1, k1, toffset, 
> tlast);
>              if ((cp1 & Integer.MIN_VALUE) != 0) {
>                  k1++;
>                  cp1 ^= Integer.MIN_VALUE;
>              }
>              cp2 = getSupplementaryCodePoint(other, cp2, k2, ooffset, 
> olast);
>              if ((cp2 & Integer.MIN_VALUE) != 0) {
>                  k2++;
>                  cp2 ^= Integer.MIN_VALUE;
>              }
> 
> 
> // thischeck will have been called twice for strings that contain 
> supplementary characters
> // but only one more for strings that don't
> 
>              int diff = checkEqual(cp1, cp2);
>              if (diff != 0) {
>                  return diff;
>              }
>          }
>          return tlen - olen;
>      }
> 
> // the code block between line 353 - 370 in webrev.04 except the last 
> line (return 0):
>      private static int checkEqual(int cp1, int cp2) {
>          if (cp1 != cp2) {
>              // try converting both characters to uppercase.
>              // If the results match, then the comparison scan should
>              // continue.
>              cp1 = Character.toUpperCase(cp1);
>              cp2 = Character.toUpperCase(cp2);
>              if (cp1 != cp2) {
>                  // Unfortunately, conversion to uppercase does not work 
> properly
>                  // for the Georgian alphabet, which has strange rules 
> about case
>                  // conversion.  So we need to make one last check before
>                  // exiting.
>                  cp1 = Character.toLowerCase(cp1);
>                  cp2 = Character.toLowerCase(cp2);
>                  if (cp1 != cp2) {
>                      return cp1 - cp2;
>                  }
>              }
>          }
>          return 0;
>      }
> 
> 
> 
>>
>> Naoto 
>