RFR: 8248655: Support supplementary characters in String case insensitive operations

naoto.sato at oracle.com naoto.sato at oracle.com
Wed Jul 15 18:27:28 UTC 2020


Hi Roger,

Thank you for your review and suggestions.

On 7/15/20 10:56 AM, Roger Riggs wrote:
> Hi Naoto,
> 
> Given the extra tests in the body of the loop, I think its worth finding 
> or creating
> a JMH test for this and checking the performance.

Good point.

> 
> With performance in mind, I would try to fall back to the UC/LC 
> conversions only
> when the bytes don't match.  See java.util.Arrays.mismatch(byte[], byte[]).
> 
> It might even be worth finding the mismatch in the byte arrays before even
> starting to look at the characters.

Agree.

> 
> There's also an option to assemble 4 bytes at a time and compare the int's.
> If they are equal you are ahead of the game.  If not, back off to comparing
> the characters and checking for surrogates.  The backoff code will be a bit
> messier though.

Sure int comparison would be faster, I am not quite sure that would be 
worth compared to more complicated recovery code though.

> 
> Also, compareToCI and regionMatchesCI could share the implementation of 
> the inner loop.

Yes.

> 
> If k1 and k2 ever get out of sync, isn't that failed assertion, so why 
> have two indexes.

This is for preventing possible case maps where one in BMP and the other 
in supplementary, thus the indices could be different.

> 
> The loop will have fewer checks against the length of it processes len-1 
> chars
> and then have a check if there is a final char to be checked.
> it can always know there is another char and can blindly get it.

Maybe, but could be complicated if the last char is a low surrogate.

Anyway, I will come up with a JMH test case and revise the webrev.

Naoto

> 
> Regards, Roger
> 
> 
> On 7/15/20 12:00 PM, naoto.sato at oracle.com wrote:
>> Hello,
>>
>> Please review the fix to the following issues:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8248655
>> https://bugs.openjdk.java.net/browse/JDK-8248434
>>
>> The proposed changeset and its CSR are located at:
>>
>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>> https://bugs.openjdk.java.net/browse/JDK-8248664
>>
>> A bug was filed against SimpleDateFormat (8248434) where 
>> case-insensitive date format/parse failed in some of the new locales 
>> in JDK15. The root cause was that case-insensitive 
>> String.regionMatches() method did not work with supplementary 
>> characters. The problem is that the method's spec does not expect case 
>> mappings of supplementary characters, possibly because it was 
>> overlooked in the first place, JSR 204 - "Unicode Supplementary 
>> Character support". Similar behavior is observed in other two 
>> case-insensitive methods, i.e., compareToIgnoreCase() and 
>> equalsIgnoreCase().
>>
>> The fix is straightforward to compare strings by code point basis, 
>> instead of code unit (16bit "char") basis. Technically this change 
>> will introduce a backward incompatibility, but I believe it is an 
>> incompatibility to wrong behavior, not true to the meaning of those 
>> methods' expectations.
>>
>> Naoto
> 


More information about the core-libs-dev mailing list