RFR: JDK-8032012, , String.toLowerCase/toUpperCase performance improvement

Wed Jan 22 15:20:56 UTC 2014

On Jan 21, 2014, at 11:05 PM, Xueming Shen <xueming.shen at oracle.com> wrote:

> Hi Paul, thanks for reviewing the changeset, comment inlined below.
> 
> On 01/20/2014 09:24 AM, Paul Sandoz wrote:
>> 
>> Some quick comments.
>> 
>> In String.toLowerCase:
>> 
>> - it would be nice to get rid of the pseudo goto using the "scan" labelled block.
>> 
> 
> webrev has been updated to remove the pseudo goto by checking the "first" against
> "len" after the loop break.
> 

Much for readable :-)

>> - there is no need for the "localeDependent" local variable.
>> 
> 
> I just tried to not throw away the result of this "localeDependent", which is still needed
> in the toXCaseEx() methods.
> 

if (lang == "tr" || lang == "az" || lang == "lt") {
  // local dependent
  return toLowerCaseEx(result, firstUpper, locale, true);
}
// otherwise false is passed to subsequent calls to toLowerCaseEx

?

>> - you might be able to optimize by doing (could depend on the answer to the next point):
>> 
>>  int c = (int)value[i];
>>  int lc = Character.toLowerCase(c);
>>  if (.....) { result[i] = (char)lc; } else { return toLowerCaseEx(result, i, locale, localeDependent); }
>> 
>> - Do you need to check ERROR for the result of toLowerCase?
>> 
>> 2586             if (c == Character.ERROR ||
>> 
> 
> Yes, Character.toLowerCase() should never return ERROR (while the package private
> Character.toUpperCaseEx() will). In theory there is no need to check if the return
> value of  Character.toUpperCase(int) > min_supplementary_code_point in our loop,
> because there is no bmp character returns a supplementary code point as its lower
> case. But since it's a data driven mapping table, there is no guarantee the unicode
> data table is not going to change in the "future", so I still keep the check. The webrev
> has been updated to use one single "if" inside the loop.
> 
> I have a "single if" implementation for the toUpperCase() as well, though don't sure
> which one is better/faster :-)
> 

OK, i suppose one could measure :-)

Not sure how much it is worth obsessing over but...

  int c = (int)value[i];
  if (c < '\u03A3') {
    result[i] = (char)Character.toLowerCase(c); // Is that safe?
  } else if (c < Character.MIN_HIGH_SURROGATE && c != 'u03A3' && (c = Character.toLowerCase(c)) < Character.MIN_SUPPLEMENTARY_CODE_POINT)) {
    result[i] = (char)c;  
  } else {
    return toLowerCaseEx(result, i, locale, localeDependent);
  }

or:

  int c = (int)value[i];
  int lc = Character.toLowerCase(c); // is that safe?
  if (c < '\u03A3') {
    result[i] = (char)lc; 
  } else if (c < Character.MIN_HIGH_SURROGATE && c != 'u03A3' && lc < Character.MIN_SUPPLEMENTARY_CODE_POINT)) {
    result[i] = (char)lc; 
  } else {
    return toLowerCaseEx(result, i, locale, localeDependent);
  }

or:

  int c = (int)value[i];
  int lc = Character.toLowerCase(c); // is that safe?
  if (c < '\u03A3' || (c < Character.MIN_HIGH_SURROGATE && c != 'u03A3' && lc < Character.MIN_SUPPLEMENTARY_CODE_POINT))) {
    result[i] = (char)lc;
  } else {
    return toLowerCaseEx(result, i, locale, localeDependent);
  }

FWIW i personally find those solutions easier to read, if they are safe w.r.t. Character.toLowerCase and that annoying greek character.

Paul.

> http://cr.openjdk.java.net/~sherman/8032012/
> 
> -Sherman
>