RFR 8253451: Performance regression in java.util.Scanner after 8236201

Tue Sep 22 22:18:49 UTC 2020

Hi Martin,

Overall it seems reasonable to replace the \x{HH} construct with a simpler escape 
sequence. I'm a bit surprised to see the performance impact, but you noticed it, so 
I'll buy that it's significant.

>          // These must be literalized to avoid collision with regex
>          // metacharacters such as dot or parenthesis
> -        groupSeparator =   "\\x{" + Integer.toHexString(dfs.getGroupingSeparator()) + "}";
> -        decimalSeparator = "\\x{" + Integer.toHexString(dfs.getDecimalSeparator()) + "}";
> +        char newSep;
> +        groupSeparator = ((newSep = dfs.getGroupingSeparator()) == ',' ||
> +                newSep == '.' || newSep == ' ' ? "\\" + newSep :
> +                "\\x{" + Integer.toHexString(newSep) + "}" );
> +        decimalSeparator = ((newSep = dfs.getDecimalSeparator()) == '.' ||
> +                newSep == ',' ? "\\" + newSep : "\\x{" +
> +                        Integer.toHexString(newSep) + "}" );

Fundamentally it's simple, but there's rather a lot going on here:

  - reuse of a local variable
  - assignment within a logical expression
  - odd line breaking
  - ternary operator

These factors make the code rather difficult to read, unnecessarily so in my opinion.

(The collections and concurrency libraries sometimes put assignments within other 
expressions, but usually to avoid unnecessary initialization of a local variable. I 
don't think that applies here.)

More to the semantics of the operations, the two cases seem oddly different: the 
group separator tests for ',' and '.' and ' ', but the decimal separator tests for 
'.' and ','. Why are they testing for different things, and in the opposite order?

I looked through the 810 locales available in Oracle JDK 15, and here's what I found:

Decimal Separators:

U+002c 359
U+002e 400
U+066b 51

Grouping Separators:

U+002c 378
U+002e 167
U+00a0 158
U+066c 51
U+2019 12
U+202f 44

I don't see a space used as a gropuing separator, but note that U+00a0 is no-break 
space. Does that occur often enough to special-case?

(Note that these numbers don't take into account how often each locale is used, nor 
are the 810 locales in the JDK the universe of all possible locales. However, it 
seems likely that any other locales would not appear frequently enough to bother 
with a special case.)

I also see that the group separator code tests for ',' first whereas the decimal 
separator code tests for '.' first. I suppose this might be a femto-optimization 
that slightly favors English-speaking locales, but it seems gratuitous to me. (If 
you can come up with a benchmark that illustrates the difference, be my guest!) :-)

I'm wondering if it might be nicer to extract this computation into a little utility 
method nearby:

>     // Escapes a separator character to prevent it from being
>     // interpreted as a regex metacharacter. >     static String escapeSeparator(char ch) {
>         if (ch == ',' || ch == '.') {
>             return "\\" + ch;
>         } else {
>             return "\\x{" + Integer.toHexString(ch) + "}";
>         }
>     }

(Note that I rewrote the comment that already existed above the lines you changed, 
as it seemed misleading.)

The calling code would then be changed to:

>         groupSeparator = escapeSeparator(dfs.getGroupingSeparator());
>         decimalSeparator = escapeSeparator(dfs.getDecimalSeparator());

I haven't benchmarked this. However, this would be called from within useLocale(), 
which is a low-frequency operation compared to the cases you benchmarked.

Also, given that U+00a0 occurs relatively frequently among locales, it might be 
worthwhile also to special-case for it, but I don't have a strong opinion on that. 
This case would likely never occur for the decimal separator, but it doesn't bother 
me. I don't think it adds much value to have different special cases for the 
different separators.

**

Finally, can this be converted into a github PR? It'll have to be made into a PR 
sooner or later in order to be integrated into the JDK mainline. (Though I guess the 
update releases are still using the current email+webrev process for now, sorry.)

s'marks

On 9/21/20 9:48 PM, Martin Balao wrote:
> Hi,
> 
> I'd like to propose a fix for JDK-8253451 [1] performance regression.
> 
> Webrev.00:
> 
>   * http://cr.openjdk.java.net/~mbalao/webrevs/8253451/8253451.webrev.00
> 
> As explained in [1], the idea for the fix is to optimize the regexp
> string for the most common group and decimal separator characters across
> locales. Please note that there are a few locales where the proposed
> optimization does not apply and the slower -but safe- regexp introduced
> by 8236201 is used.
> 
> Testing:
> 
>   * No regressions observed in jdk/java/util/Scanner (5/5 passed)
> 
>   * Verified performance improvement back to what it was before 8236201
> (on my Americas locale).
> 
> Thanks,
> Martin.-
> 
> --
> [1] - https://bugs.openjdk.java.net/browse/JDK-8253451
>