<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations
Joe Wang
huizhe.wang at oracle.com
Tue Jul 21 17:05:25 UTC 2020
On 7/20/2020 8:58 PM, naoto.sato at oracle.com wrote:
>> The short-cut worked well. There's maybe a further optimization we
>> could do to rid us of the performance concern (or the need to run
>> additional performance tests). Consider the cases where strings in
>> comparison don't contain any sup characters, if we make the
>> toLower/UpperCase() block a method and call it before and after the
>> surrogate-check block, the routine would be effectively the same as
>> prior to the introduction of the surrogate-check block, and regular
>> comparisons would suffer the surrogate-check only once (the last
>> check). For strings that do contain sup characters then, the
>> toLower/UpperCase() method would have been called twice, but then we
>> don't care about the performance in that situation. You may call the
>> existing codePointAt method too when an extra getChar and performance
>> is not a concern (but that's your call.
>
> Can you please elaborate this more? What's "the last check" here?
What I meant was that we could expand the 'short-cut' from case
sensitive to case insensitive, that is in addition to the line 337, do
that line 353 - 370 case-insensitive check as well.
I guess it can be explained better with code. I added inline comment:
for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast;
k1++, k2++) {
int cp1 = (int)getChar(value, k1);
int cp2 = (int)getChar(other, k2);
// does a case-insensitive check:
if (checkEqual(cp1, cp2) == 0) {
continue;
}
// this block will be run once for strings that don't contain any
supplementary characters
// Check for supplementary characters case
cp1 = getSupplementaryCodePoint(value, cp1, k1, toffset,
tlast);
if ((cp1 & Integer.MIN_VALUE) != 0) {
k1++;
cp1 ^= Integer.MIN_VALUE;
}
cp2 = getSupplementaryCodePoint(other, cp2, k2, ooffset,
olast);
if ((cp2 & Integer.MIN_VALUE) != 0) {
k2++;
cp2 ^= Integer.MIN_VALUE;
}
// thischeck will have been called twice for strings that contain
supplementary characters
// but only one more for strings that don't
int diff = checkEqual(cp1, cp2);
if (diff != 0) {
return diff;
}
}
return tlen - olen;
}
// the code block between line 353 - 370 in webrev.04 except the last
line (return 0):
private static int checkEqual(int cp1, int cp2) {
if (cp1 != cp2) {
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
cp1 = Character.toUpperCase(cp1);
cp2 = Character.toUpperCase(cp2);
if (cp1 != cp2) {
// Unfortunately, conversion to uppercase does not work
properly
// for the Georgian alphabet, which has strange rules
about case
// conversion. So we need to make one last check before
// exiting.
cp1 = Character.toLowerCase(cp1);
cp2 = Character.toLowerCase(cp2);
if (cp1 != cp2) {
return cp1 - cp2;
}
}
}
return 0;
}
>
> Naoto
More information about the core-libs-dev
mailing list