RFR: JDK-8032012, , String.toLowerCase/toUpperCase performance improvement
Paul Sandoz
paul.sandoz at oracle.com
Wed Jan 22 15:20:56 UTC 2014
On Jan 21, 2014, at 11:05 PM, Xueming Shen <xueming.shen at oracle.com> wrote:
> Hi Paul, thanks for reviewing the changeset, comment inlined below.
>
> On 01/20/2014 09:24 AM, Paul Sandoz wrote:
>>
>> Some quick comments.
>>
>> In String.toLowerCase:
>>
>> - it would be nice to get rid of the pseudo goto using the "scan" labelled block.
>>
>
> webrev has been updated to remove the pseudo goto by checking the "first" against
> "len" after the loop break.
>
Much for readable :-)
>> - there is no need for the "localeDependent" local variable.
>>
>
> I just tried to not throw away the result of this "localeDependent", which is still needed
> in the toXCaseEx() methods.
>
if (lang == "tr" || lang == "az" || lang == "lt") {
// local dependent
return toLowerCaseEx(result, firstUpper, locale, true);
}
// otherwise false is passed to subsequent calls to toLowerCaseEx
?
>> - you might be able to optimize by doing (could depend on the answer to the next point):
>>
>> int c = (int)value[i];
>> int lc = Character.toLowerCase(c);
>> if (.....) { result[i] = (char)lc; } else { return toLowerCaseEx(result, i, locale, localeDependent); }
>>
>> - Do you need to check ERROR for the result of toLowerCase?
>>
>> 2586 if (c == Character.ERROR ||
>>
>
> Yes, Character.toLowerCase() should never return ERROR (while the package private
> Character.toUpperCaseEx() will). In theory there is no need to check if the return
> value of Character.toUpperCase(int) > min_supplementary_code_point in our loop,
> because there is no bmp character returns a supplementary code point as its lower
> case. But since it's a data driven mapping table, there is no guarantee the unicode
> data table is not going to change in the "future", so I still keep the check. The webrev
> has been updated to use one single "if" inside the loop.
>
> I have a "single if" implementation for the toUpperCase() as well, though don't sure
> which one is better/faster :-)
>
OK, i suppose one could measure :-)
Not sure how much it is worth obsessing over but...
int c = (int)value[i];
if (c < '\u03A3') {
result[i] = (char)Character.toLowerCase(c); // Is that safe?
} else if (c < Character.MIN_HIGH_SURROGATE && c != 'u03A3' && (c = Character.toLowerCase(c)) < Character.MIN_SUPPLEMENTARY_CODE_POINT)) {
result[i] = (char)c;
} else {
return toLowerCaseEx(result, i, locale, localeDependent);
}
or:
int c = (int)value[i];
int lc = Character.toLowerCase(c); // is that safe?
if (c < '\u03A3') {
result[i] = (char)lc;
} else if (c < Character.MIN_HIGH_SURROGATE && c != 'u03A3' && lc < Character.MIN_SUPPLEMENTARY_CODE_POINT)) {
result[i] = (char)lc;
} else {
return toLowerCaseEx(result, i, locale, localeDependent);
}
or:
int c = (int)value[i];
int lc = Character.toLowerCase(c); // is that safe?
if (c < '\u03A3' || (c < Character.MIN_HIGH_SURROGATE && c != 'u03A3' && lc < Character.MIN_SUPPLEMENTARY_CODE_POINT))) {
result[i] = (char)lc;
} else {
return toLowerCaseEx(result, i, locale, localeDependent);
}
FWIW i personally find those solutions easier to read, if they are safe w.r.t. Character.toLowerCase and that annoying greek character.
Paul.
> http://cr.openjdk.java.net/~sherman/8032012/
>
> -Sherman
>
More information about the core-libs-dev
mailing list