<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations
naoto.sato at oracle.com
naoto.sato at oracle.com
Tue Jul 21 21:01:35 UTC 2020
Thank you, Joe. I got it now. Will try out and benchmark.
Naoto
On 7/21/20 10:05 AM, Joe Wang wrote:
>
>
> On 7/20/2020 8:58 PM, naoto.sato at oracle.com wrote:
>>> The short-cut worked well. There's maybe a further optimization we
>>> could do to rid us of the performance concern (or the need to run
>>> additional performance tests). Consider the cases where strings in
>>> comparison don't contain any sup characters, if we make the
>>> toLower/UpperCase() block a method and call it before and after the
>>> surrogate-check block, the routine would be effectively the same as
>>> prior to the introduction of the surrogate-check block, and regular
>>> comparisons would suffer the surrogate-check only once (the last
>>> check). For strings that do contain sup characters then, the
>>> toLower/UpperCase() method would have been called twice, but then we
>>> don't care about the performance in that situation. You may call the
>>> existing codePointAt method too when an extra getChar and performance
>>> is not a concern (but that's your call.
>>
>> Can you please elaborate this more? What's "the last check" here?
>
> What I meant was that we could expand the 'short-cut' from case
> sensitive to case insensitive, that is in addition to the line 337, do
> that line 353 - 370 case-insensitive check as well.
>
> I guess it can be explained better with code. I added inline comment:
>
> for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast;
> k1++, k2++) {
> int cp1 = (int)getChar(value, k1);
> int cp2 = (int)getChar(other, k2);
>
> // does a case-insensitive check:
>
> if (checkEqual(cp1, cp2) == 0) {
> continue;
> }
>
> // this block will be run once for strings that don't contain any
> supplementary characters
>
> // Check for supplementary characters case
> cp1 = getSupplementaryCodePoint(value, cp1, k1, toffset,
> tlast);
> if ((cp1 & Integer.MIN_VALUE) != 0) {
> k1++;
> cp1 ^= Integer.MIN_VALUE;
> }
> cp2 = getSupplementaryCodePoint(other, cp2, k2, ooffset,
> olast);
> if ((cp2 & Integer.MIN_VALUE) != 0) {
> k2++;
> cp2 ^= Integer.MIN_VALUE;
> }
>
>
> // thischeck will have been called twice for strings that contain
> supplementary characters
> // but only one more for strings that don't
>
> int diff = checkEqual(cp1, cp2);
> if (diff != 0) {
> return diff;
> }
> }
> return tlen - olen;
> }
>
> // the code block between line 353 - 370 in webrev.04 except the last
> line (return 0):
> private static int checkEqual(int cp1, int cp2) {
> if (cp1 != cp2) {
> // try converting both characters to uppercase.
> // If the results match, then the comparison scan should
> // continue.
> cp1 = Character.toUpperCase(cp1);
> cp2 = Character.toUpperCase(cp2);
> if (cp1 != cp2) {
> // Unfortunately, conversion to uppercase does not work
> properly
> // for the Georgian alphabet, which has strange rules
> about case
> // conversion. So we need to make one last check before
> // exiting.
> cp1 = Character.toLowerCase(cp1);
> cp2 = Character.toLowerCase(cp2);
> if (cp1 != cp2) {
> return cp1 - cp2;
> }
> }
> }
> return 0;
> }
>
>
>
>>
>> Naoto
>
More information about the core-libs-dev
mailing list