<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

Fri Jul 17 23:36:00 UTC 2020

Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:

- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and after 
the change:

before:
Benchmark                                Mode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 2.811  ns/op
StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 1.135  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344  ns/op
StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 1.499  ns/op

after:
Benchmark                                Mode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 4.603  ns/op
StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 5.672  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965  ns/op
StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 0.272  ns/op

Here, "sup" means all supplementary characters, BMP otherwise. "lower" 
means each character requires upper->lower casemap. "upperLower" means 
all characters are the same, except the last character which requires 
casemap.

I think the result is reasonable, considering surrogates check are now 
mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but 
it did not seem to benefit here. In fact, the performance degraded 
partly because I implemented the short cut, and possibly for the 
overhead of extra checks.

Naoto

On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote:
> Hello,
> 
> Please review the fix to the following issues:
> 
> https://bugs.openjdk.java.net/browse/JDK-8248655
> https://bugs.openjdk.java.net/browse/JDK-8248434
> 
> The proposed changeset and its CSR are located at:
> 
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8248664
> 
> A bug was filed against SimpleDateFormat (8248434) where 
> case-insensitive date format/parse failed in some of the new locales in 
> JDK15. The root cause was that case-insensitive String.regionMatches() 
> method did not work with supplementary characters. The problem is that 
> the method's spec does not expect case mappings of supplementary 
> characters, possibly because it was overlooked in the first place, JSR 
> 204 - "Unicode Supplementary Character support". Similar behavior is 
> observed in other two case-insensitive methods, i.e., 
> compareToIgnoreCase() and equalsIgnoreCase().
> 
> The fix is straightforward to compare strings by code point basis, 
> instead of code unit (16bit "char") basis. Technically this change will 
> introduce a backward incompatibility, but I believe it is an 
> incompatibility to wrong behavior, not true to the meaning of those 
> methods' expectations.
> 
> Naoto