<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

Joe Wang huizhe.wang at oracle.com
Tue Jul 21 17:05:25 UTC 2020



On 7/20/2020 8:58 PM, naoto.sato at oracle.com wrote:
>> The short-cut worked well. There's maybe a further optimization we 
>> could do to rid us of the performance concern (or the need to run 
>> additional performance tests). Consider the cases where strings in 
>> comparison don't contain any sup characters, if we make the 
>> toLower/UpperCase() block a method and call it before and after the 
>> surrogate-check block, the routine would be effectively the same as 
>> prior to the introduction of the surrogate-check block, and regular 
>> comparisons would suffer the surrogate-check only once (the last 
>> check). For strings that do contain sup characters then, the 
>> toLower/UpperCase() method would have been called twice, but then we 
>> don't care about the performance in that situation. You may call the 
>> existing codePointAt method too when an extra getChar and performance 
>> is not a concern (but that's your call.
>
> Can you please elaborate this more? What's "the last check" here?

What I meant was that we could expand the 'short-cut' from case 
sensitive to case insensitive, that is in addition to the line 337, do 
that line 353 - 370 case-insensitive check as well.

I guess it can be explained better with code. I added inline comment:

         for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast; 
k1++, k2++) {
             int cp1 = (int)getChar(value, k1);
             int cp2 = (int)getChar(other, k2);

// does a case-insensitive check:

             if (checkEqual(cp1, cp2) == 0) {
                 continue;
             }

// this block will be run once for strings that don't contain any 
supplementary characters

              // Check for supplementary characters case
             cp1 = getSupplementaryCodePoint(value, cp1, k1, toffset, 
tlast);
             if ((cp1 & Integer.MIN_VALUE) != 0) {
                 k1++;
                 cp1 ^= Integer.MIN_VALUE;
             }
             cp2 = getSupplementaryCodePoint(other, cp2, k2, ooffset, 
olast);
             if ((cp2 & Integer.MIN_VALUE) != 0) {
                 k2++;
                 cp2 ^= Integer.MIN_VALUE;
             }


// thischeck will have been called twice for strings that contain 
supplementary characters
// but only one more for strings that don't

             int diff = checkEqual(cp1, cp2);
             if (diff != 0) {
                 return diff;
             }
         }
         return tlen - olen;
     }

// the code block between line 353 - 370 in webrev.04 except the last 
line (return 0):
     private static int checkEqual(int cp1, int cp2) {
         if (cp1 != cp2) {
             // try converting both characters to uppercase.
             // If the results match, then the comparison scan should
             // continue.
             cp1 = Character.toUpperCase(cp1);
             cp2 = Character.toUpperCase(cp2);
             if (cp1 != cp2) {
                 // Unfortunately, conversion to uppercase does not work 
properly
                 // for the Georgian alphabet, which has strange rules 
about case
                 // conversion.  So we need to make one last check before
                 // exiting.
                 cp1 = Character.toLowerCase(cp1);
                 cp2 = Character.toLowerCase(cp2);
                 if (cp1 != cp2) {
                     return cp1 - cp2;
                 }
             }
         }
         return 0;
     }



>
> Naoto 



More information about the i18n-dev mailing list