RFR: JDK-8032012, , String.toLowerCase/toUpperCase performance improvement

Xueming Shen xueming.shen at oracle.com
Wed Feb 5 17:58:51 UTC 2014


Hi,

Let's try to wrap it up, otherwise I may drop the ball somewhere :-)

On 01/22/2014 07:20 AM, Paul Sandoz wrote:
>
>
>>
> if (lang == "tr" || lang == "az" || lang == "lt") {
>    // local dependent
>    return toLowerCaseEx(result, firstUpper, locale, true);
> }
> // otherwise false is passed to subsequent calls to toLowerCaseEx
>
> ?
>

toLowerCaseEx will also be invoked later (in your another suggestion next), which
needs a "localeDependent".


>>> - you might be able to optimize by doing (could depend on the answer to the next point):
>>>
>>>   int c = (int)value[i];
>>>   int lc = Character.toLowerCase(c);
>>>   if (.....) { result[i] = (char)lc; } else { return toLowerCaseEx(result, i, locale, localeDependent); }
>>>
>>> - Do you need to check ERROR for the result of toLowerCase?
>>>
>>> 2586             if (c == Character.ERROR ||
>>>
>> Yes, Character.toLowerCase() should never return ERROR (while the package private
>> Character.toUpperCaseEx() will). In theory there is no need to check if the return
>> value of  Character.toUpperCase(int)>  min_supplementary_code_point in our loop,
>> because there is no bmp character returns a supplementary code point as its lower
>> case. But since it's a data driven mapping table, there is no guarantee the unicode
>> data table is not going to change in the "future", so I still keep the check. The webrev
>> has been updated to use one single "if" inside the loop.
>>
>> I have a "single if" implementation for the toUpperCase() as well, though don't sure
>> which one is better/faster :-)
>>
> OK, i suppose one could measure :-)
>
> Not sure how much it is worth obsessing over but...
>
>    int c = (int)value[i];
>    if (c<  '\u03A3') {
>      result[i] = (char)Character.toLowerCase(c); // Is that safe?
>    } else if (c<  Character.MIN_HIGH_SURROGATE&&  c != 'u03A3'&&  (c = Character.toLowerCase(c))<  Character.MIN_SUPPLEMENTARY_CODE_POINT)) {
>      result[i] = (char)c;
>    } else {
>      return toLowerCaseEx(result, i, locale, localeDependent);
>    }
>
> or:
>
>    int c = (int)value[i];
>    int lc = Character.toLowerCase(c); // is that safe?
>    if (c<  '\u03A3') {
>      result[i] = (char)lc;
>    } else if (c<  Character.MIN_HIGH_SURROGATE&&  c != 'u03A3'&&  lc<  Character.MIN_SUPPLEMENTARY_CODE_POINT)) {
>      result[i] = (char)lc;
>    } else {
>      return toLowerCaseEx(result, i, locale, localeDependent);
>    }
>
> or:
>
>    int c = (int)value[i];
>    int lc = Character.toLowerCase(c); // is that safe?
>    if (c<  '\u03A3' || (c<  Character.MIN_HIGH_SURROGATE&&  c != 'u03A3'&&  lc<  Character.MIN_SUPPLEMENTARY_CODE_POINT))) {
>      result[i] = (char)lc;
>    } else {
>      return toLowerCaseEx(result, i, locale, localeDependent);
>    }
>
> FWIW i personally find those solutions easier to read, if they are safe w.r.t. Character.toLowerCase and that annoying greek character.

The existing code has the clear and explicit logic that

(1) j.l.C.toLowerCase() can NOT handle surrogate(s)
(2) j.l.C.toLowerCase() can NOT handle \u03A3
(3) if for some reasons j.l.C.toLowerCase() returns a supplementary character
      for a non-bmp character, the current "one-char" based result can NOT handle
      it,
So in above cases, go to "ex" version.

The suggested approach is kinda of hacky and the logic is not clear.
For example it tries to optimize with c < \u03a3 cases, it works with the assumption
that all lower cases of < \u03a3 characters are never a supplementary characters,
which is true for now and probably will be true forever, but it actually changes the
design and implementation logic (be driven by the backend unicode case mapping
data, not an "assumption").

The

(c<  Character.MIN_HIGH_SURROGATE&&  c != 'u03A3'&&  lc<  Character.MIN_SUPPLEMENTARY_CODE_POINT)

part appears to be fine, but its logic is not as explicit as the
existing one.

Given the measurement indicates it does not bring in measurable
improvement, personally I would prefer to keep the existing one,
if you don't have a very strong opinion on this.

http://cr.openjdk.java.net/~sherman/8032012/

Thanks!
-Sherman





More information about the core-libs-dev mailing list