RFR: 8248655: Support supplementary characters in String case insensitive operations

Roger Riggs Roger.Riggs at oracle.com
Wed Jul 15 17:56:21 UTC 2020


Hi Naoto,

Given the extra tests in the body of the loop, I think its worth finding 
or creating
a JMH test for this and checking the performance.

With performance in mind, I would try to fall back to the UC/LC 
conversions only
when the bytes don't match.  See java.util.Arrays.mismatch(byte[], byte[]).

It might even be worth finding the mismatch in the byte arrays before even
starting to look at the characters.

There's also an option to assemble 4 bytes at a time and compare the int's.
If they are equal you are ahead of the game.  If not, back off to comparing
the characters and checking for surrogates.  The backoff code will be a bit
messier though.

Also, compareToCI and regionMatchesCI could share the implementation of 
the inner loop.

If k1 and k2 ever get out of sync, isn't that failed assertion, so why 
have two indexes.

The loop will have fewer checks against the length of it processes len-1 
chars
and then have a check if there is a final char to be checked.
it can always know there is another char and can blindly get it.

Regards, Roger


On 7/15/20 12:00 PM, naoto.sato at oracle.com wrote:
> Hello,
>
> Please review the fix to the following issues:
>
> https://bugs.openjdk.java.net/browse/JDK-8248655
> https://bugs.openjdk.java.net/browse/JDK-8248434
>
> The proposed changeset and its CSR are located at:
>
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8248664
>
> A bug was filed against SimpleDateFormat (8248434) where 
> case-insensitive date format/parse failed in some of the new locales 
> in JDK15. The root cause was that case-insensitive 
> String.regionMatches() method did not work with supplementary 
> characters. The problem is that the method's spec does not expect case 
> mappings of supplementary characters, possibly because it was 
> overlooked in the first place, JSR 204 - "Unicode Supplementary 
> Character support". Similar behavior is observed in other two 
> case-insensitive methods, i.e., compareToIgnoreCase() and 
> equalsIgnoreCase().
>
> The fix is straightforward to compare strings by code point basis, 
> instead of code unit (16bit "char") basis. Technically this change 
> will introduce a backward incompatibility, but I believe it is an 
> incompatibility to wrong behavior, not true to the meaning of those 
> methods' expectations.
>
> Naoto



More information about the core-libs-dev mailing list